Column-oriented DBMS

A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.

It is possible to achieve some benefits of column-oriented and row-oriented organization with any database. By denoting one as column-oriented we are referring to both the ease of expression of a column-oriented structure and the focus on optimizations for column-oriented workloads.[1][2] This approach is in contrast to row-oriented or row store databases and with correlation databases, which use a value-based storage structure.

Contents

Description

A relational database program must show its data as two-dimensional tables, of columns and rows, but store it as one-dimensional strings. For example, a database might have this table.

EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000

This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary).

This table exists in the computer's memory (RAM) and storage (hard drive). Although RAM and hard drives differ mechanically, the computer's operating system abstracts them. Still, the database must coax its two-dimensional table into a one-dimensional series of bytes, for the operating system to write to either the RAM, or hard drive, or both.

A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on.

      1,Smith,Joe,40000;
      2,Jones,Mary,50000;
      3,Johnson,Cathy,44000;

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.

      1,2,3;
      Smith,Jones,Johnson;
      Joe,Mary,Cathy;
      40000,50000,44000;

This is a simplification. Partitioning, indexing, caching, views, OLAP cubes, and transactional systems such as write ahead logging or multiversion concurrency control all dramatically affect the physical organization. That said, online transaction processing (OLTP)-focused RDBMS systems are more row-oriented, while online analytical processing (OLAP)-focused systems are a balance of row-oriented and column-oriented.

Benefits

Comparisons between row-oriented and column-oriented systems are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other delays in computers. Sometimes, reading a megabyte of sequentially stored data takes no more time than one random access.[3] Further, because seek time is improving at a slow rate relative to CPU power (see Moore's Law), this focus will likely continue on systems reliant on hard-disks for storage. Following is a set of over-simplified observations which attempt to paint a picture of the trade-offs between column- and row-oriented organizations. Unless, of course, the application can be reasonably assured to fit most/all data into memory, in which case huge optimizations are available from in-memory database systems.

  1. Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
  2. Column-oriented systems are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
  3. Row-oriented systems are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
  4. Row-oriented systems are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.

In practice, row-oriented architectures are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column stores are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes). However, there are a number of proven row-based OLAP RDBMS that handles terabytes, or even petabytes of data, such as Teradata.

Compression

Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data. For example, many popular modern compression schemes, such as LZW or run-length encoding, make use of the similarity of adjacent data to compress. While the same techniques may be used on row-oriented data, a typical implementation will achieve less effective results.[4][5]

To improve compression, several implementations (such as Vertica) sort the rows. For example, using bitmap indexes, sorting can improve compression by an order of magnitude.[6] To maximize the compression benefits of the lexicographical order with respect to run-length encoding, it is best to use low-cardinality columns as the first sort keys.[7] For example, given a table with columns sex, age, name, it would be best to sort first on the value sex (cardinality of two), then age (cardinality of <150), then name.

Columnar compression achieves a reduction in disk space at the expense of efficiency of retrieval. Retrieving all data from a single row is more efficient when that data is located in a single location, such as in a row-oriented architecture. Further, the greater adjacent compression achieved, the more difficult random-access may become, as data might need to be uncompressed to be read. Therefore, column-oriented architectures are sometimes enriched by additional mechanisms aimed at minimizing the need for access to compressed data.[8]

Implementations

Column stores or transposed files have been implemented from the early days of DBMS development, beginning in the 1970s. For example, Statistics Canada implemented the RAPID system[9] in 1976 and used it for processing and retrieval of the Canadian Census of Population and Housing as well as several other statistical applications. RAPID was shared with other statistical organizations throughout the world and used widely in the 1980s. It continued to be used by Statistics Canada until the 1990s.

For many years, only the Sybase IQ product was commercially available in the column-oriented DBMS class. However, that has changed rapidly in the last few years with many open source and commercial implementations.

Current examples of column-oriented DBMSs include:

Commercial
Free and open source software

See also

References

  1. ^ A decomposition storage model, Copeland, George P. and Khoshafian, Setrag N., SIGMOD '85, 1985.
  2. ^ C-Store: A column-oriented DBMS, Stonebraker et al., Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
  3. ^ The Star Schema Benchmark and Augmented Fact Table Indexing, Pat & Betty O’Neil, Xuedong Chen and Stephen Revilak, TPC Technology Conference 8/24/09
  4. ^ D. J. Abadi, S. R. Madden, N. Hachem, Column-stores vs. row-stores: how different are they really?, in: SIGMOD’08, 2008, pp. 967–980.
  5. ^ N. Bruno, Teaching an old elephant new tricks, in: CIDR ’09, 2009.
  6. ^ Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.
  7. ^ Daniel Lemire and Owen Kaser, Reordering Columns for Smaller Indexes, Information Sciences 181 (12), 2011
  8. ^ Brighthouse: an analytic data warehouse for ad-hoc queries, Slezak et al., Proceedings of the 34th VLDB Conference, Auckland, New Zealand, 2008
  9. ^ A DBMS for Large Statistical Databases, Turner, Hammond, Cotton, Proceedings of VLDB 1979, Rio de Janeiro, Brazil